100 research outputs found

    exploiting parallelism in many core architectures lattice boltzmann models as a test case

    Get PDF
    In this paper we address the problem of identifying and exploiting techniques that optimize the performance of large scale scientific codes on many-core processors. We consider as a test-bed a state-of-the-art Lattice Boltzmann (LB) model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. The regular structure of Lattice Boltzmann algorithms makes it relatively easy to identify a large degree of available parallelism; the challenge is that of mapping this parallelism onto processors whose architecture is becoming more and more complex, both in terms of an increasing number of independent cores and – within each core – of vector instructions on longer and longer data words. We take as an example the Intel Sandy Bridge micro-architecture, that supports AVX instructions operating on 256-bit vectors; we address the problem of efficiently implementing the key computational kernels of LB codes – streaming and collision – on this family of processors; we introduce several successive optimization steps and quantitatively assess the impact of each of them on performance. Our final result is a production-ready code already in use for large scale simulations of the Rayleigh-Taylor instability. We analyze both raw performance and scaling figures, and compare with GPU-based implementations of similar codes

    The three dimensional Ising spin glass in an external magnetic field: the role of the silent majority

    Full text link
    We perform equilibrium parallel-tempering simulations of the 3D Ising Edwards-Anderson spin glass in a field. A traditional analysis shows no signs of a phase transition. Yet, we encounter dramatic fluctuations in the behaviour of the model: Averages over all the data only describe the behaviour of a small fraction of it. Therefore we develop a new approach to study the equilibrium behaviour of the system, by classifying the measurements as a function of a conditioning variate. We propose a finite-size scaling analysis based on the probability distribution function of the conditioning variate, which may accelerate the convergence to the thermodynamic limit. In this way, we find a non-trivial spectrum of behaviours, where a part of the measurements behaves as the average, while the majority of them shows signs of scale invariance. As a result, we can estimate the temperature interval where the phase transition in a field ought to lie, if it exists. Although this would-be critical regime is unreachable with present resources, the numerical challenge is finally well posed.Comment: 42 pages, 19 figures. Minor changes and added figure (results unchanged

    Critical parameters of the three-dimensional Ising spin glass

    Full text link
    We report a high-precision finite-size scaling study of the critical behavior of the three-dimensional Ising Edwards-Anderson model (the Ising spin glass). We have thermalized lattices up to L=40 using the Janus dedicated computer. Our analysis takes into account leading-order corrections to scaling. We obtain Tc = 1.1019(29) for the critical temperature, \nu = 2.562(42) for the thermal exponent, \eta = -0.3900(36) for the anomalous dimension and \omega = 1.12(10) for the exponent of the leading corrections to scaling. Standard (hyper)scaling relations yield \alpha = -5.69(13), \beta = 0.782(10) and \gamma = 6.13(11). We also compute several universal quantities at Tc.Comment: 9 pages, 5 figure

    Janus II: a new generation application-driven computer for spin-system simulations

    Get PDF
    This paper describes the architecture, the development and the implementation of Janus II, a new generation application-driven number cruncher optimized for Monte Carlo simulations of spin systems (mainly spin glasses). This domain of computational physics is a recognized grand challenge of high-performance computing: the resources necessary to study in detail theoretical models that can make contact with experimental data are by far beyond those available using commodity computer systems. On the other hand, several specific features of the associated algorithms suggest that unconventional computer architectures, which can be implemented with available electronics technologies, may lead to order of magnitude increases in performance, reducing to acceptable values on human scales the time needed to carry out simulation campaigns that would take centuries on commercially available machines. Janus II is one such machine, recently developed and commissioned, that builds upon and improves on the successful JANUS machine, which has been used for physics since 2008 and is still in operation today. This paper describes in detail the motivations behind the project, the computational requirements, the architecture and the implementation of this new machine and compares its expected performances with those of currently available commercial systems.Comment: 28 pages, 6 figure

    Prospects for K+→π+ννˉK^+ \to \pi^+ \nu \bar{ \nu } at CERN in NA62

    Full text link
    The NA62 experiment will begin taking data in 2015. Its primary purpose is a 10% measurement of the branching ratio of the ultrarare kaon decay K+→π+ννˉK^+ \to \pi^+ \nu \bar{ \nu }, using the decay in flight of kaons in an unseparated beam with momentum 75 GeV/c.The detector and analysis technique are described here.Comment: 8 pages for proceedings of 50 Years of CP

    Externalities and the nucleolus

    Full text link
    In most economic applications, externalities prevail: the worth of a coalition depends on how the other players are organized. We show that there is a unique natural way of extending the nucleolus from (coalitional) games without externalities to games with externalities. This is in contrast to the Shapley value and the core for which many different extensions have been proposed

    Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor

    Get PDF
    In this paper we report on our early experience on porting, optimizing and benchmarking a Lattice Boltzmann (LB) code on the Xeon-Phi co-processor, the first generally available version of the new Many Integrated Core (MIC) architecture, developed by Intel. We consider as a test-bed a state-of-the-art LB model, that accurately reproduces the thermo-hydrodynamics of a 2D- fluid obeying the equations of state of a perfect gas. The regular structure of LB algorithms makes it relatively easy to identify a large degree of available parallelism. However, mapping a large fraction of this parallelism onto this new class of processors is not straightforward. The D2Q37 LB algorithm considered in this paper is an appropriate test-bed for this architecture since the critical computing kernels require high performances both in terms of memory bandwidth for sparse memory access patterns and number crunching capability. We describe our implementation of the code, that builds on previous experience made on other (simpler) many-core processors and GPUs, present benchmark results and measure performances, and finally compare with the results obtained by previous implementations developed on state-of-the-art classic multi-core CPUs and GP-GPUs

    Benchmarking GPUs with a parallel Lattice-Boltzmann code

    No full text
    Accelerators are an increasingly common option to boost performance of codes that require extensive number crunching. In this paper we report on our experience with NVIDIA accelerators to study fluid systems using the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism, such as recent multi- and many-core processors and GPUs; however, the challenge of exploiting a large fraction of the theoretically available performance of this new class of processors is not easily met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas. The computational features of this model make it a significant benchmark to analyze the performance of new computational platforms, since critical kernels in this code require both high memory-bandwidth on sparse memory addressing patterns and floating-point throughput. In this paper we consider two recent classes of GPU boards based on the Fermi and Kepler architectures; we describe in details all steps done to implement and optimize our LB code and analyze its performance first on single-GPU systems, and then on parallel multi-GPU systems based on one node as well as on a cluster of many nodes; in the latter case we use CUDA-aware MPI as an abstraction layer to assess the advantages of advanced GPU-to-GPU communication technologies like GPUDirect. On our implementation, aggregate sustained performance of the most compute intensive part of the code breaks the 11 double-precision Tflops barrier on a single-host system with two GPUs
    • …
    corecore